Clustering and classification

Overview of the data

This week I am using a built-in dataset, namely the Boston dataset which contains housing information in the Boston Mass area. The data has been collected by the U.S Census Service and it is also available online.

The Boston dataset consists of just 506 observations and there are 14 variables. All of the variables are continuous. Let us take a closer look at what they are.

The variable names are not exactly self-explanatory. Because our analysis will again depend upon them, I will briefly list them here with a short description.

Variable Description
crim crime rate (per capita)
zn proportion of residential land zoned for lots over 25,000 sq.ft.
indus proportion of non-retail business acres
chas Charles River dummy variable (1 if tract bounds river; 0 otherwise)
nox nitric oxides concentration
rm average number of rooms per dwelling
age proportion of owner-occupied units built prior to 1940
dis weighted distances to five Boston employment centres
rad index of accessibility to radial highways
tax full-value property-tax rate per $10,000
ptratio pupil-teacher ratio
black proportion of blacks
lstat % lower status of the population
medv median value of owner-occupied homes in $1000’s

Thus, the Boston dataset consists of a variety of information: demographic, economic, and environmental factors as well as safety. I will not discuss the variables in more detail just now, but save it for later when we explore the standardized dataset - the comparison between the original and scaled variables will be more meaningful if the summaries are presented together.

So, let us next take a graphical tour of the data. In the figure below, each variable is plotted against the other variables.

Here we see some interesting distribution patterns, such as
* accumulation close to the edges and corners of the box - e.g. age/lstat
* more curved shapes close to the lower left corner - e.g. nox/dis
* compact round shapes close to one of the corners - e.g. rad/tax
* binary positionas in all pairs for variables chas and rad.

Another way of approaching the relationships between variables is to look at their correlations. Below is a correlogram where, instead of numerical values, the correlations are presented as coloured spots. Red colour indicates a negative correlation and blue colour indicates a posititve correlation. The stronger the correlation, the bigger the spot.

Three pairs stand out as having a strong negative correlation: indus/dis, nox/dis and lstat/medv. It is interesting to note the connection between this and the previous graph. If we go back to the previous graph, we can see that these pairs are those which showed curved patterns.
As for positive correlations, there are many to choose from but one stands out from the rest: rad/tax. This was also identified above and it is the one with a compact round shape.

Standardization

The first step before attempting actual classification is to standardize the dataset. This is because we need the variables both to be comparable and to fit assumptions on which the classification method is based on (variables are normally distributed and their variance is the same).

Standardization is a simple operation in which the values of each variable are calculated using the following formula:
\(scaled(x) = [x-mean(x)]/sd(x)\)
i.e. the variable mean is substracted from the original value of the variable and the resulting difference is then divided by the standard deviation of the variable.

Now let us take a closer look at a selection of these scaled variables and compare them to the original, non-scaled, dataset (original values are shown first and then the scaled values). Here are summaries of age

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.9    45.0    77.5    68.6    94.1   100.0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -2.330  -0.837   0.317   0.000   0.906   1.120

tax (full value property tax rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     187     279     330     408     666     711
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -1.310  -0.767  -0.464   0.000   1.530   1.800

and black (proportion of blacks in the area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     375     391     357     396     397
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -3.90    0.20    0.38    0.00    0.43    0.44

The standardization procedure has affected the variables considerably. To begin with, the range has been radically reduced: for age [2.9,100.0] -> [-2.333,1.116], for tax [187,711] -> [-1.313,1.796], and for black [0,397] -> [-3.90, 0.44]. What is most noteworthy, though, is the change in the mean values - in the scaled dataset, the mean value is 0.000 for all variables. The variables are not aligned and thus more easily comparable.

I will use this scaled dataset from this point onwards. There are still some minor adjustments to be made. In the analysis to come, I will use crime rate as the target variable, and in order to do so the continuous variable crim needs to be converted into a new, categorical variable crime (after which the original variable will be removed, because all the other variables will serve as the explanatory variables). The quantiles of the scaled variable crim

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -0.42   -0.41   -0.39    0.00    0.01    9.92

will serve as the breakpoints for the new categorical variables crime - the table below shows its distribution between the newly ceated categories.

## crime
##      low  med_low med_high     high 
##      127      126      126      127

Creating a model from your data is interesting by itself, but it falls short if there is nothing to test it on. Luckily, we can divide the dataset into training and test sets where the former is used for creating the model and the latter is used as new data for testing the model’s prediction power. I will use 80% of my data for training and the remaining 20% for testing.

Linear discriminant analysis

Now that the data has been prepared and divided, we can fit the linear discriminant analysis (LDA) on the previously created training set. LDA is a classification method where the classes are known beforehand. In our case they are the previously created four categories denoting different crime levels.

LDA finds a linear combination of the explanatory variables which best characterizes or separates the different classes of the target variable. In this current case, the categorical variable crime is the target variable and all other variables are the explanatory, or predictor, variables. Here is a scatterplot of the LDA where different colours represent the different classes. The length and direction of the arrows show the impact that each of the predictor variables has in the model.

It appears that accessibility to radial highways (rad) is associated with a high crime rate (in blue) and this relationship is more prominent than those between other predictors and lower crime rates. Although the medium high crime rate (in green) is almost completely separated from the high crime rate, it is somewhat a more uniform class than the lower crime rates (in black and red) and has an association with environmental factors (nox).

Prediction

Next, we can use the test set for predicting the classes in this new dataset and thus assess the model’s perfomance. Here is a cross tabulation of the actual vs predicted values.

##           predicted
## correct    low med_low med_high high Sum
##   low       15      11        0    0  26
##   med_low    8      14        4    0  26
##   med_high   0       6       10    2  18
##   high       0       0        0   32  32
##   Sum       23      31       14   34 102

Let us begin with the high crime rate: all predictions for this class are correct ones. Medium high crime rate is also most often predicted correctly. On the other hand, the predictions for low and medium low crime rates are not as accurate. These results confirm what we saw above in connection with the biplot.

K-means clustering

The aim is to create clusters (i.e. groups) of the observations. This can be done by adopting the k-means method which assigns the observations into k clusters based on the similarity of the objects. Similarity can in turn be measured in terms of distances. The following analysis is based on the Euclidean measure. Below is a summary of those distances.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.13    3.46    4.82    4.91    6.19   14.40

Now we can run the k-means algorithm. Because I do not yet know how many cluster I should use, I will begin by setting them at 10 which should be a relatively safe bet.

Fortunately, we do not need to rely on guessing as there are more sophisticated ways of finding the optimal number of clusters. Here, I will use total of within cluster sum of squares (WCSS) to do so. WCSS calculates the number of clusters for which the observations are closest to the center of the cluster. The optimal number is the one where to value drops significantly. In the line graph below we can see that the drop occurs where the number of clusters is two.

I will now run the k-means algorithm again with the clusters set at two. The figure below is a similar to the representation of pairs we saw earlier, but here it isthe scaled variables that are plotted against each other. Also, instead of the original values, the patterns seen here represent the similarities (or differences), i.e. distances, between the variables.

Disregarding the colours for a brief moment, the overall distribution patters appear nearly identical to those we saw earlier. However, now that clusters are included and shown in different colours, we can see how the patterns relate to the clusters. The differences between the clusters is not always clear, but for very main pairs they reveal certain uniqua, or even binary patterns. The differences are most clear for all instances of the target variable crim (crime rate) and chas (dummy variable), but also for rad (accessibility to radial highways) and tax (full-value property-tax rate).

3D plot

The icing on the cake is the following 3D plot. Not only does it look really good, but you can move it around as well. It shows the same four classes as in the LDA plot seen earlier, and here too the high crime rate (in yellow) stands apart from the other crime levels.